What is Linear Regression?
Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables.
$ Y = βo + β1X + ∈ $
1. Y - This is the variable we predict
2. X - This is the variable we use to make a prediction
3. βo - This is the intercept term. It is the prediction value you get when X = 0
4. β1 - This is the slope term. It explains the change in Y when X changes by 1 unit. ∈ - This represents the residual value, i.e. the difference between actual and predicted values.
a. Ordinary Least Square - ∑[Actual(y) - Predicted(y')]²
b. Generalized Least Square
c. Percentage Least Square
d. Total Least Square
e. Least absolute deviation
i. It uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent
ii.OLS is easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features
iii.Interpretation of OLS is much easier than other regression techniques
In [4]:
# Case Study : Predicting Housing Price
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [87]:
# Importing Boston housing data
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()
boston.feature_names
Out[87]:
In [37]:
X = boston.data
y = boston.target
features= pd.DataFrame(X,columns=boston.feature_names)
Out[37]:
1. There exists a linear and additive relationship between dependent (DV) and independent variables (IV)
2. Multicollinearity - present of correlation b/w independent variables
3. Heteroskedestacity- Absence of constant variance in the error terms
4. Autocorrelation - Presences of correlation in error terms
5. The dependent variable and the error terms should possess a normal distribution
In [85]:
# For simplicity purpose we can use only the variables which are numeric(Excluding Address variable )
# Validating assumption 2
# Multicollinearity - present of correlation b/w independent variables
# The dependent variable and the error terms should possess a normal distribution
#sns.pairplot(features)
sns.heatmap(features.corr(),annot=True)
# Findings: 1. For simplicity assumption Multicollinearity is not present
# Variance Inflation Factor (VIF) - This metric is used to check multicollinearity.
# VIF <=4 implies no multicollinearity but VIF >=10 suggests high multicollinearity.
# Alternatively, you can also look at the tolerance (1/VIF) value to determine correlation in IVs.
# In addition, you can also create a correlation matrix to determine collinear variables.
Out[85]:
In [43]:
# Validating assumption 5
# The dependent variable and the error terms should possess a normal distribution
sns.distplot(y)
# Findings: 1. The dependent variable is normally distributed
Out[43]:
There is little you can do when your data violates regression assumptions. An obvious solution is to use tree-based algorithms which capture non-linearity quite well. But if you are adamant at using regression, following are some tips you can implement:
1. If your data is suffering from non-linearity, transform the IVs using sqrt, log, square, etc.
2.If your data is suffering from heteroskedasticity, transform the DV using sqrt, log, square, etc. Also, you can use weighted least square method to tackle this problem.
3. If your data is suffering from multicollinearity, use a correlation matrix to check correlated variables. Let's say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, you can use penalized regression methods such as lasso, ridge, elastic net, etc.
4.You can do variable selection based on p values. If a variable shows p value > 0.05, we can remove that variable from model since at p> 0.05, we'll always fail to reject null hypothesis.
In [92]:
# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)
print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))
In [93]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
fit= lm.fit(X_train,y_train)
print('% Response variable variation explained by linear model :',fit.score(X_train,y_train) * 100)
In [ ]:
'''
import statsmodels.api as sm
model = sm.OLS(y_train,X_train).fit()
predictions = model.predict(X_test)
# Print out the statistics
model.summary()
# 1. Residual vs. Fitted Values Plot to check heteroskedasticity -If any pattern exist then heteroskedasticity exist else no
# 2. Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity.
# If you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.
# 2. Durbin Watson Statistic (DW) - This test is used to check autocorrelation. Its value lies between 0 and 4.
# A DW=2 value shows no autocorrelation. However, a value between 0 < DW < 2 implies positive autocorrelation, while 2 < DW < 4 implies negative autocorrelation.
# Findings : We dont have Autocorrelation
'''
In [96]:
## Predictions
from sklearn import metrics
predictions = lm.predict(X_test)
print('Predicted R2 determines how well the model predicts responses for new observations :',metrics.r2_score(y_test,predictions)*100)
In [97]:
# Scatter plot to determine how well the model predicts responses from new observations
plt.scatter(y_test,predictions)
Out[97]:
In [98]:
# Checking whether the error terms possess a normal distribution
import seaborn as sns
sns.distplot((y_test-predictions))
# Findings: Yes, the erroe term is normally distributed
Out[98]:
In [99]:
from sklearn import metrics
print('MAE = ', metrics.mean_absolute_error(y_test,predictions))
print('MSE = ',metrics.mean_squared_error(y_test,predictions))
print('RMSE = ',np.sqrt(metrics.mean_squared_error(y_test,predictions)))
# Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model.
# Findings: From the below metrics, the model is performing pretty good.
Great we have completed it ! Any ways to improve the model?
In [ ]: